Let's get started. But before I get started, I do want to give one final warning to you guys.
This is a talk about defending. I'm not breaking anything. You might have stumbled into the wrong
room. So I'm trying to build something here. This is not like an Android malware thing that
people are going to patch tomorrow. I'm trying to make something that could potentially change
the game in a few ways and help a few problems that are very, very dear to my heart when it
comes to defending blue teams and things like that. Also, there's a lot of math involved. I mean,
not as much as people would start rioting and throwing things at me. But if you haven't had
lunch yet, you might experience some difficulties. It's not good to do math on an empty stomach. But
anyway, if anybody is leaving, I guess we should get started then.
Just want to give just a quick note to where I'm coming from. And a bit of this presentation is
going to be a bit ranty. But I'm going to be talking about some of the things that I'm going
to do on my home. Because this experience, I've basically been in information security for like
12 years. And one of the reasons why you have never ever heard of me is because I did most
of this in Brazil. So I used to work with one of the largest information consultancies there.
And where I come from, where I did the most was actually leading SOCs. Putting
security operation centers together, building sim solutions, building log manager solutions, either
side of the company or in this huge massive sock you guys see in the movies and of course
in all your own organizations where we're trying to make things like day to day of log
management and defending work. And after I was done with that, I got into machine learning
and I thought it would be a very, very interesting subject and I was quick to try to find ways
that this could potentially help the problem that we have been facing. Okay? And this is
my first presentation. I do want my shot. Okay? So ‑‑ man, perks of the job, man.
Perks of the job. Don't tell my employer, but I used to drink before the presentations
all the time. That's the only way I could get ‑‑ excuse me. Anyway, I'm going to
talk about it. Here's the overall structure. Right? I'm going to talk about the problem.
I'm going to talk about machine learning so that ‑‑ I want to be sure that we are
in the same level here. I don't want to dazzle you with some math shit and tell you it works.
Okay? I just want you to get the basic feeling. Of course this is not an introduction to machine
learning 101. We have to be like four hours here talking about this. Not at all. I don't
have the time. It's not that cool anyway. But I just want you to understand there's
want to make sure we understand where machine learning is coming from, what are special
considerations for information security, which is something that's very, very important.
And most machine learning experts or data scientists or however you may call them, they
will not understand this point. And a little bit of the case study that actually goes through
the process of one of these algorithms that I have designed. Okay? So let's get moving
because we have a lot of ground to cover. First of all, you've all learned of log management,
right? You've got a bunch of logs, you've got the certifications, you've got the best
practices. It's everywhere, you know? You just have to deal with it. Everything is generating
logs for you. But the problem I have is ‑‑ and bear in mind I've worked with a lot of SIM
solutions and log management solutions over the time. And nobody's happy. And everyone's
deploying this. They have the state of the art. They're doing everything they can. And
everybody ‑‑
And I say everybody. You can quote Gartner. 93% of organizations had breaches they couldn't
catch with the state of the art deployment that they had. And everyone is unhappy and
looking for other options. I mean, it can't be all that bad. What's going on here? You
know? I got this great graph from SANS. And I do apologize they don't have the horizontal
axis markings. I have no idea what those numbers are. But that first one seems big. Okay?
So I will take their word for it.
So but, anyway, what really caught my eye here was this. Is that what's the biggest
problem in SIM deployments? Well, it seems I cannot identify key events from normal behavior.
Maybe it's VA speed, or something, but the key events from normal behavior actually
They have correlation rules, they do all kinds of fancy shit, they put some graphics on,
you know, try to pretend that you understand what the logs are talking about.
That's that.
Anyway, the point here, I cannot tell ‑‑ my biggest problem is that I cannot tell key
events from normal behavior.
It's like if you bought a car and your key complaint is that I can't seem to get from
point A from point B. I know the car doesn't start, I cannot find the accelerator.
I mean, this is what we're buying these things for and we're not able to make them work.
One of the reasons why I believe we cannot make them work is because how do you use these
things?
Okay, you're already coming up with some magic numbers where, okay, I want to be alarmed
if someone starts scanning me if they hit me like five times.
Okay, so if I hit you four, I guess I'm home free, right?
Absolutely no problem with that.
Or I'm trying to make up this arbitrary rules where if something happened and then other
things happened, okay, that's an alert.
That's something that should be brought up to me.
And in a way, it's all permutations of this.
All this 300 and I don't know how many rules people put together, it's all that, you know?
And when you're on an engagement, it's actually ‑‑ you're just iterating on this until the
customer is satisfied and you see the strike there or you run out of money, which is much,
much more likely.
And I don't know.
It sounds like a Ponzi scheme to me, you know?
Like this thing was built in order to make sure that we generate the most consulting
out of that.
And I don't claim that SIEM solutions are the hugest offenders.
If any one of you ever worked with identity management, okay, these guys, they take the
crown.
But anyway.
Anyway.
And this is the point I want to make.
So one of these vendors, they have like this huge ass curriculum, right?
You get one month of training.
I want to be an expert in this solution.
You get one month of training, four weeks, 20 days.
There's not a single hour of this training that is about explaining to you what is the
content that comes with it.
It will teach you four weeks on how to build new rules, on how to create new dashboards,
on how to create new things, but, I mean, aren't you telling me that you have all these
amazing dashboards?
How do I use them?
How do I apply them to what I need to do?
So it seems that things are harder than they should be.
And I don't want to touch about behavior rules.
I'll talk a little bit more about them in the future.
They do help a little of this configuration.
But I'm not really to bash the tool, right?
So I mean, some people are very, very good at log management.
You see?
That's my log management.
And ‑‑ but they are very few.
So, I mean, some people are very, very good at log management.
Right?
I've worked with teams which are very, very good.
You give them like six months, you know, they would make this thing sing.
But the problem I have is that this does not scale.
We do not have enough people who are good enough at this, at understanding all of that.
And those are fantastic tools if you want to build something.
But they're not ready.
And most people buy them thinking that they will solve a problem, they will be ready.
And it's not.
It's not there.
It's definitely not there.
What really, really got me scared was big data.
Big data.
Because we are ‑‑ I mean, the point is we have this smallish databases, okay?
So if you think about SIM solutions, anyone here who is in general IT, this is nothing
more than a highly vertical business intelligence data warehousing solution.
I mean, from the 90s.
90s where they never invented columnar databases, right?
But now they are starting to catch up.
And now you're going to have integration.
You're going to be ingesting petabytes of data per day.
Who's going to do that?
Who's going to analyze this?
We got to create the whole rules ourselves?
I mean, usually when you handle this kind of data, you really have to work with statistical
analysis.
You've got to know this.
It's a whole different discipline from actually from your ‑‑ from your, you know, your,
your traditional information security analysis.
And I'm not saying ‑‑ I've met people who are very good at both.
But that makes the pool even smaller, okay?
And I mean, given all of this, the only solution I can come here at this stage and tell you
about is that we need an Army, right?
Let the robots talk to themselves, right?
I mean, if there's a machine generating data, let's have a machine read it.
Reading the data for us.
We're hopeless.
We're not going to be able to keep up.
There's absolutely no way.
But that's when we start talking about machine learning, okay?
And I want to make sure that we get the basics right so that we can understand a lot of what
we can and can't do, okay?
We really haven't got at the Skynet stage as of now.
I know there's a little bit of discussion, although I've met some people who are really
intent of building this, yeah, let's put all this machine learning thing together.
It's going to be awesome.
I'm not so sure.
But the main point, when you're thinking about machine learning, is that you're not writing
the code.
You're not really writing the code for the decision‑making.
You're writing a code that will identify some data for you and then it will start making
inferences based off the data, right?
So it's pretty much as if I went to the computer and I said, computer, this is a chair, chair,
okay?
And then the computer will look at the chair.
Okay.
Yeah.
It's got some metal rims.
It's got this stuffing on the back.
Got it.
It's a chair.
But okay.
This is another chair.
Oh.
That's a bit different.
But okay.
I think I got the idea.
And as you show enough chairs for the computer, it will eventually be able to generalize what
a chair is, okay?
And the secret is all in how you tell the computer what a chair is.
And that's the real, in a way, science slash art.
The art around this is how do you build what we call the features in order for these algorithms
to be able to take this in and make decisions based on what they saw, if something is this
or that.
Okay?
I'm going to give a lot of more examples.
And this is everywhere, okay?
Absolutely everywhere.
And I think it's a shame that we don't use this more in information security.
So everyone is selling you shit out of this.
You go to Amazon.
Okay.
Amazon is one of the big examples here.
They'll use this thing, this technique called collaborative filtering, which will pretty
much say, okay, if you bought these things, and there's a lot of people just like you
who bought these things, if they bought something else, I'm probably going to suggest it to
you.
You guys are probably pretty similar.
So it will make decisions.
If you look at the math.
They have like a billion row matrix here, billion column here, and they just multiplicate
the hell out of that shit.
They come up with you should be reading this book.
It's actually quite awesome.
But I mean, this is something that has been studied.
It's something that's very well understood by the marketing community and the sales
community.
When you talk about trading, and this is a very good picture, which is one of the cautionary
tales here.
This is the flash crash that happened, I think it was two or three years ago, where
actually most of our high frequency trading right now that's run by these large quantitative
funds, it's algorithms.
They're just talking to each other.
And they're selling and buying based on what's happening.
And some of them got confused there and almost crashed the whole economy.
But I mean, someone was watching the monitor, no, no, no, wait, wait, just back up a little
bit.
Okay.
You guys play nice now.
Let's separate this fight.
But it's everywhere.
You know, this is a very sensitive place, what it is.
On the other hand, you've got people doing some really, really cool stuff with image
and voice recognition.
So this is actually a picture ‑‑ that's good.
So, yeah.
Okay.
Hello.
Should I continue?
No.
.
So we ‑‑ evidently he's the only one that does not know why we are here.
So what are we called?
Shot the noob.
Shot the noob.
That's right.
So your first time speaking at DEF CON?
Yes, sir.
Congratulations.
Thank you.
We would like to shout some alcohol down your throat.
We need someone from the audience.
Oh, right.
Raise your hand if you're a first timer.
You in the blue shirt.
Because you were faster than everybody else.
Sunday morning at DEF CON.
You've got to love it.
Thank you.
What is this shit on the screen?
Oh, my God.
It's an evil machine learning.
Visualization.
That is really cool.
All right.
Data visualization.
And everybody, first time at DEF CON.
Cheers.
Cheers.
We'll see you soon.
Okay.
I'm happy now.
I guess we can continue.
Anyway.
What's the scare about?
So the guys from Google, they actually set up like a 16,000-cluster machine, and they
told the machine, okay, find me cat pictures on the Internet, right?
They really know their audience, right?
So the ‑‑ And it did.
So they used a technique that's called deep learning, which actually creates, I don't
know, an arbitrary deep and complicated neural network out of the blue.
It's ‑‑ I didn't claim to understand it, but it's mighty fancy.
And anyway, this is a visualization of what a computer thinks a cat is.
That's awesome.
I mean, I can see a cat there.
Man, this is the future, man.
The computer is watching cat videos now.
This is ‑‑
Anyway, on a more serious note, okay?
What is this being used for security right now?
So a lot of the fraud detection systems, they will use some type of another.
The most basic technique they use is called clustering, which I'll talk a little bit
more further on.
But they're trying to find deviations on a pattern.
So if they can create a baseline and identify where you are in a group of customers, they
would be able to ‑‑
to see if you're not ‑‑ if you're not you in the dimensions that they can look
at.
So if you used to do this kind of shit when you're using your credit card, if you do something
very different, okay, that's probably a flag.
And here's where I touch about behavior monitoring.
Behavior monitoring, no matter what people tell you, it's not machine learning.
You know?
My economics calculator, you know, the HP12C can do rolling averages.
You know?
It's not learning.
Rolling averages are very easy to do.
So statistical analysis is helpful, okay, but it's a first step in understanding maybe
a Berger scope, but that's not machine learning in any shape or form.
And finally, spam filters, which are the unsung heroes of machine learning, you know?
You remember the Bayesian filters?
Yeah.
That's it.
That's actually ‑‑ that's actually the algorithm that they use.
And the point I'm trying to make here is that how many talks did you see this year
or the past two years about spam?
Nobody seems to be doing research on that anymore.
I know I have my ‑‑ I have my Gmail account.
I opened it in 2004, and it's a long time ago.
I mean, I'm pretty sure every single spammer on the Internet has a hold of my account.
I don't see spam.
I don't get spam.
Do you get spam on Gmail or something like that?
I mean, I do get all the crap I signed up for, you know?
I do get phishing e‑mails, of course, but phishing e‑mails, they are specifically
crafted to look like a normal e‑mail to get past all of this.
This is a problem that we don't really look at anymore because it took, I don't know,
ten years, maybe 15 years.
Actually I found ‑‑ I was doing research for another talk, I found a paper from 98,
that talks about some techniques of using Bayesian learning for spam filters.
It seems like the problem is solved in a way.
And that's one of the things ‑‑ one of the messages that I want to leave there.
Okay?
Maybe the work ‑‑ if we start doing this work now, maybe we will get very good at picking
up this stuff in about five to ten years.
And this is the power.
Once you have enough data ‑‑ and arguably Google would be the one who has the most,
most data.
We probably can agree on that.
You can get pretty good at this.
Anyway.
Now we start getting a little bit more technical.
So the idea here is that there's two big kinds of machine learning.
You get your supervised learning, where you are actually telling the computer or the program
what these things are, okay?
And there's two major groups, which is called classification, which I'm telling, okay, this
is a chair.
This is a table.
This is a chair.
This is a table.
A bunch of chairs.
A bunch of tables.
Okay.
Here's an object.
What is it?
Is it a chair?
Is it a table?
So you're pretty much giving it data to train on, okay?
And you're giving it labels, which is this is a chair, this is a table.
And then you present new data in order for it to predict to you, okay?
And that's the word that we use.
If it's a chair, it's a table.
When we think about regression, we're actually not looking for a binary answer.
Okay?
But we're looking for, okay, how much of this is a table and how much of this is a
chair?
I know it doesn't sound to make sense, but if you look at a stool or maybe one of those
benches you have for pianos, you know, that actually look like a table, but it's like
high, the computer would get confused.
It's a .4 chair, you know?
Yeah.
I mean, it's a very general example.
But this is the kind of stuff that we do.
We're trying to analyze where this lies.
And based on this, we can either use this to, okay, no, this is definitely a chair or
definitely a table.
Or we can use this data to make the humans better informed on what the decisions they
should take.
And then you've got unsupervised learning, which is, okay, I don't know anything about
this.
Look at this and tell me what you can find.
All right?
All right?
So you have two big groups as well.
You have what's called clustering, which is the one I mentioned before in fraud solutions,
which I mean, it's by far the most abused machine learning technique for anything ever.
Okay?
So it's not always applicable.
And it has a very, very fatal flaw that you have to actually tell the computer how many
clusters you're looking for.
So that it can be able to guess.
Which would be the separation of the data that you send it.
So of course there are techniques to discover that.
But then it starts getting very complicated.
And most of the people don't actually do all this leg work.
So it can get a little bit fuzzy.
And finally, decomposition, which in a way is a tool for you to design better algorithms
in a nutshell.
So you've got ‑‑ I've got a chair, okay?
And I need to tell the computer what a chair is, and I have all these different things
that I could tell the computer about.
Its height, what's it made of, and all these things.
And so I get like, I don't know, a hundred possible variables out of a chair.
And I'm trying to find which ones matter the most in deciding if something is a chair
or it is a table.
Okay?
It could be argued maybe that for chair and table it could be the height, you know?
So probably if I ran this chair table model on a decomposition thing and tried to tell
me that the PCA there is principal component analysis.
Which are the guys who are really making a difference here?
It would probably tell me that the height is one of the guys that should definitely
choose for my model because it really makes an awesome difference.
Anyway, by the way, if you like this, this is ‑‑ this tutorial is awesome.
So it's not magic.
You still have to train the computer.
You still have to give it data.
But one of the basic principles of machine learning is that if you design your algorithm,
if you design your model well, it will generally get better with more data.
Okay?
So there's a lot of mathematical proofs that you can show that the more data you have,
and this is the drawing on the left, your E in, which would be the error you have inside
your training model, which is the data that you know you're using to train, will ‑‑
in both the E out, which is the error of the stuff you've never, ever seen before in
your life, they will converge to an expected error.
And I mean, don't let anyone tell you that they have 100% working machine learning model
because there's no such a thing.
It will make mistakes.
We make mistakes as we look at it.
And it's all part of the way that we make sense of reality, and we're trying to emulate
that into a computer.
There will always be errors, and you can always hope and work your model for it to be the
least that it can be.
But the point is, you have to be careful of what data you're taking in.
Okay?
So I'm going to give you guys for an example what I've used, the Sans D Shield data, which
is this data that you get from them about firewall blocks.
And I mean, they wouldn't tell me, but I would assume that it's pretty U.S. centric.
So there is definitely this bias on the results that I'm going to show you guys.
But you always have to be cognizant of that, and you always have to look out for adversaries.
And this is one of the points I wanted to make in this talk, that everyone who is into
building these models and they do the machine learning, they do not understand that people
would like to fuck with them.
Okay?
So, I mean, we understand that in a very deep personal level.
And one of the first things that came to my mind is, okay, even if I build something here,
how am I going to deal with it?
How can I potentially exploit it?
How can I send some random noise and things like that, and that will render this completely
useless?
And I'm going to talk a little bit like that as one of the weaknesses of this specific
model that I built.
But I mean, it is there.
Right?
People will try to mess with you.
And if this ‑‑ and I really believe it will.
If this becomes a valid method of actually defending and helping defense, there will
be a lot of talks in this conference with some really crazy math guys on how they will
defeat this.
It will be the crypto wars all over again or something like that.
But anyway, the point I tried to make here, just to exemplify this, remember spam engines,
all right?
So when we started getting this Bayesian thing going on, the spammers just started posting
whole sonnets of Shakespeare at the end.
You know?
And then the model would think, hmm, I haven't really seen many spams with the word DAO.
So that's probably a legitimate e‑mail.
And that's pretty much what it was.
That's pretty much ‑‑ and then people refined that and people made sure that they
would understand this sort of thing.
They evolved the model as well.
And we are getting to the level that we're getting to.
But people will always find a way to break things.
And I think especially on these kind of applications, this is something that we have to be very,
very careful about.
Anyway, enough introduction.
Let's get to it.
Okay?
Let's chew on the logs.
And the idea ‑‑ the most of the talk here is about the feature engineering, which
is the really important part here.
There are ‑‑ on that slide about the kinds of machine learning, there were some
names of algorithms.
There's a whole bunch of them.
And you just try them all.
You know?
You create a process for you to try and see which works best for your data because there's
no one size fits all.
The problem is what data you're feeding it.
And that's the difficult selection process and what everyone will tell you that's the
real hard work.
Anyway, I was telling you about this shield.
And what I did with them is I've been collecting their bulk logs.
So if you go to their website, you'll have a lot of top ten ports that are being attacked,
top ten places bad people are coming from, from the Internet and things like that.
But if you ask really nicely, and they were completely awesome about that, I really would
like to take the SANS Institute for their help here.
You can get the bulk data.
And I just started mining it starting in January.
I got like seven months of shit.
And this is very basic stuff.
And this is one of the points I'm trying to make.
This is firewall data, blocked firewall data, okay, and it's summarized.
I get to know how many kind of blocks we got for each one, which I use as one of the ways
to decide who I'm going to select to the model.
But that's all there is, right?
But you know that for this group of people who are submitting the log files, these are
the guys who were potentially attacking them, right?
I mean, you always start with a port scan.
You always start with something like that to see which machines are up.
So if people were hitting them on the firewall, and this is one of the points I'm trying to
make, on machines that didn't have that port open, well, maybe if they read the right port,
you know what I mean?
That would be an issue.
That would be something worth looking at.
And this is just a summary of the amount of data, okay?
So roughly by day, I got a million observations.
An observation is pretty much this IP address attacked this port, okay?
So from that, I would select the behaviors that I would be looking for.
And when you summarize that ‑‑ sorry, when you decompose that, and I didn't get
the decomposed logs.
I just got a number there.
You would get roughly 30 million log events per day.
And one point I wanted to make out, the thing I wanted to point out here is that this is
not big data at all.
This is like nothing.
I'm running this on my laptop, and it's not even a good one at that, okay?
So don't let anybody tell you that, okay, oh, my God, it's the cloud, you know?
So yeah, you know what it is.
So.
So I'm really trying to bring this to a tangible level.
And this is one of the objectives I have in this talk.
So I'm just getting some data, and I'm doing mining with it on my own laptop.
So one of the intuitions here of this model is the proximity.
And anyone who has ever done real SOC work, they develop ‑‑ they kind of develop
this instinct where, okay, I've seen this.
I've seen this IP address before.
Or I've seen people coming from that side of the woods before.
And even ‑‑ and it's interesting, because I've seen a lot of ‑‑ I've seen a lot
of things being caught out of mistakes.
So people would look at IP addresses, oh, I remember this guy.
Let's look at this.
But actually, no, the guy had never appeared.
It was something that was ‑‑ an IP address that was similar to that one, okay?
And.
Yeah, there's actually something there.
And so I started doing some research on that, and there's actually some anecdotal evidence
and some really, really hard statistical analysis facts about this.
And one of the things I like to point out is the spam house cyber bunker thing.
And I don't mean the DDoS.
I don't care about that.
But I care about the fact that spam house actually stood up and said, okay, forget about
that ISP.
Anything that comes from there, it's bad.
They're just going to try to spam you.
Just block the whole thing.
I can't take it anymore.
That's an interesting conclusion.
And they really fight to be ‑‑ how can I say ‑‑ level‑headed.
Sorry?
I don't know anyone ‑‑ I don't know anyone there.
I can't really comment on that.
We'll have to take your word.
So ‑‑ but then you've got the Google report.
The Google malware report where they said it blatantly, okay, there are places where
they're more likely to have malware than not.
And there's this paper from a researcher in Brazil called something more ‑‑ I forgot
his first name.
He actually did a statistical analysis for the past, I don't know, seven years of logs
from the Brazilian research network, the one that connects the universities.
And he was like ‑‑ he was proving that shit like statistically all the time.
Man, this is not random at all.
There is some information we can potentially extract.
So then what I started doing to create my features, I started grouping those logs by
arbitrary net blocks, okay?
Let's pick some things, okay?
Let's choose and see if it comes out all right.
And of course by ASN, which really ‑‑ which is what really shines, which pretty much the
synchronous net ‑‑ autonomous system they're coming from on the Internet.
And for that I use a lot of team services.
They have an office.
They have an awesome who is service that you can pull and things like that.
So here's a visualization of that, okay?
And I know it seems like a lot to take in.
And that's understandable because that's the Internet up there.
And the point here is that this is a projection that tries to maximize, okay, the proximity
as you draw the IP addresses.
So I put the drawing there.
This is called the Hilbert curve.
Okay?
And this is the kind of transformation that it's trying to do.
If you think of the IP addresses as if they were on a straight line, okay, from 0000 to
255, 255, blah, blah, blah, I'm actually twisting this like this to make sure that
the IP addresses are as close as possible to their neighbors, okay?
And this is actually data from accumulated data until the 20th of July on people that
were trying to attack the D shield group on port 22.
Okay?
I don't think that's random.
And you can see ‑‑ of course, some places on this map, they are dark by nature.
The DOD is not really doing anything there, though they have like two slash eight blocks,
you know?
Even IBM, they have a whole ‑‑ I think it's eight dot something.
They don't really use that.
But even on the other places, you can see that there's some density going on, okay?
And by the way, if you're wondering, if you're in the DEFCO network, you're there at the
start, right?
So ‑‑ I'm trying to make ‑‑ there is some clustering here.
And you can start ‑‑ if you look at this cluster, you can start to jump to conclusions,
right?
Oh, my God!
You know?
Look at these guys.
They're obviously up to something.
And then you start doing some more mining of this.
And oh, my God!
You know?
They're definitely up to something.
And I mean, this is data, okay?
But the point I'm trying to make here.
It's very easy to do this and start jumping to conclusions.
And we really have to see this thing to fruition at the end.
So if you look at it, what actually happens is that it's us, right?
We're pretty much beating the shit out of each other all the time.
So anyway, I just want to make a point that if you're just blocking for, oh, I hate this
country.
I'm not going to let it in.
That doesn't really mean anything.
You just have to be careful with that.
Anyway.
Let's get moving.
Okay.
So we get this.
There's proximity.
But I want to be able to decay that because the neighborhoods might renovate, okay?
So I want to make sure that if people are changing ISPs, they're changing their anonymous
proxies, I want to be able to accompany that.
Otherwise, I just have a massive blacklist that goes on forever and I'm eventually going
to block the whole Internet off.
So as time passes, I want to be able to forget these things happened and do some sort of
exponential decay here, right?
So you choose your metric and you can see that, I mean, after a few ‑‑ after three
to four months, it doesn't really matter anymore.
You completely forgot what if people were attacking you, which actually pretty much
mimics what an analyst would do, right?
You have so much memory, right?
You go out to party and things like that.
So if something happened a month ago, you're not as likely to remember this as if you were
‑‑ it happened yesterday.
So given those two intuitions, we have to start calculating this.
Okay.
And the point here is that we create some sort of rankings, which are going to be the
features to our model by IP address, by some net blocks that you choose, and by the ASNs.
If you're missing data, I took a shortcut, I just said, okay, if I can't discover what
your ASN is or if you are a Bogan, you're bad, man.
I'm just going to put you with a high score, which is a ‑‑ I mean, it's a shortcut,
but it's just to make sure that we don't ignore these people, which maybe would be
bad in a scenario like this.
And the point is, for each day that we have, each day that happened based on the time series
that you saw, we're going to have a calculation.
So this is what happened this day.
So the following day, we will actually decay this by the function, the exponential function,
and then we add what happened the following day.
We're pretty much adding one.
So this happened this day, this happened that day.
And we go decaying the rest.
Okay.
So the importance of us having this history is that if we're using log data to detect
stuff, log data will have dates.
So I can't use future calculations I used with things from yesterday.
That doesn't make any sense.
The model won't make any sense.
And it will tend to survivorship bias, which is I think that something happened in the
future influences in the past, and that's a very rookie mistake that you can do with
this sort of thing.
So you just have to make sure that you ‑‑ you're probably not going to be able to do that.
You're pairing the data, the features that you have with the data that you have.
And this is just an example, right?
So the vertical scale is actually log scale.
So it's like when you see 6 there, it's like almost a million ‑‑ it's a million, right?
And if you get 1 on this score, which is the horizontal axis, it means you were hitting
me every single day.
So I mean, I found it pretty interesting.
Okay.
At least 10,000 guys, this is RDP, that were hitting me every, every single day.
And even if I was using that as a black list, which is a perfectly valid way to do this,
look at all the guys that I'm ignoring.
That doesn't mean anything compared to the rest of the data that I have.
And I have another example here on 22, when you can see pretty much the similar behavior,
right?
There will be some guys who will be able to get you, but if you only look at the daily
black lists.
There's so much information you're leaving behind that you could potentially leverage
to help what you're doing.
We good?
Too much math?
Who fixed that?
Yeah.
So ‑‑ oh, God.
Man, I'm not going to finish like this.
I'm sorry.
Oh, no, no.
All right.
All right.
Enough with your math bulls ‑‑
It is time to drink.
I can respect that.
Wait a second.
Didn't we already do a shot with you this morning?
Yes, you did.
No, not this morning.
You just came to this room before.
I can do another one.
There's no problem.
He wants to do another one.
I'm sorry.
Is that a Russian accent I'm hearing?
No, it's Brazilian.
Really?
Yeah.
Oh, my God.
It's my room.
I say do it again.
Wait.
Wait.
How the ‑‑ maybe we just want to drink.
I don't know.
Someone from the audience.
Wait.
Wait.
What?
First time.
Maybe I saw you ‑‑
First time.
Right here.
No, you did come to this room.
And you've done a shot.
Yes.
This is the second time ‑‑
Fucking A.
That is awesome.
It's great.
All right.
For the second time.
Awesome.
We've got a few shots.
Wait a minute.
All right.
Wait.
What's your name?
Steve.
Steve.
Clearly we've lost track at this point.
Steve, this is everybody.
Hey, everyone.
Steve.
Steve.
First time attendee.
Here's to Steve.
Cheers.
Thank you.
I guess we'll be back in five minutes.
Okay.
Let's make sure we get close to finishing this.
The point here is we've got a bunch of numbers, right?
But we know with this bunch of numbers that these guys ‑‑
Back to the math.
Yeah.
Sorry, guys.
We should do a drinking talk.
Anyway.
Yeah.
Yeah.
So ‑‑
You're talking.
Oh, man.
We get this ‑‑ so we get these features calculated and we pair them with the data
that we had.
So the assumption here is if someone hit the firewall, they are out to get you.
And they should be considered bad.
Okay?
So you put these guys on your training model, so pretty much the IP address, the features
that you calculated, and you feed that to the ‑‑ you feed that to the model.
Okay?
And you usually have to take more than one data, more than one day, I'm sorry.
And that's why separating the ranks makes sense on daily basis because you want to make
sure you're pairing the right date with the right data that you're using.
All right?
So rule of thumbs.
And it all depends on the algorithm that you're using.
Okay?
The point is you can't always have malicious data.
You have to have good data as well, otherwise the computer will just say, yay, everything
is a chair, man.
That was an easy job.
So the point is you've got to find something.
I did ‑‑ I took a bunch of IP addresses from Alexa, from Chromium to make sure that
we got enough data to pair it up with.
So it was at least 50‑50.
Okay.
So as far as algorithms, I'm not going to go into this because I don't have enough time.
But support vector machines are awesome.
Okay?
They do math that people don't even understand about.
Okay?
It's like the mathematicians, well, I think there is an infinite dimension where this
makes sense.
I don't care.
I'm just going to calculate this in the matrix and it should sort itself out.
And it works.
Scary.
Scary as fuck.
Anyway.
Here's the point.
Okay?
So we train this every day.
Okay?
So I got a bunch of ports.
Okay.
What's happening on port 22?
What's happening on port 389?
What's happening on port 25?
And I would get something around 83 to 95% of training accuracy.
And accuracy is what I got right, so bad is bad and good is good, divided by the whole
number of things that I gave it to process.
And this is good.
But it's not really accurate.
Because especially if you're using time sensitive data, this is a technique called cross validation
which doesn't really work very well for that.
So what I did to actually train it was, okay, let's look at the following day.
So if I have ‑‑ and this is the point, right?
I did all this calculation for today.
Who are the guys who are most likely to be attacking me tomorrow based on all this data
that I got?
And I would get something from 75 to 85%.
And I'm going to break that down for you so that you guys can really understand what 75%
79.
79 to 95%.
Man, this is hard.
Anyway.
Just an idea of the progression, okay?
You have to run ‑‑ of course, if you have a model from February, you try to run
it from something that happened in July, you're going to have a terrible time because a lot
of things have happened from them.
And it also illustrates a little bit about the moving about.
Of the environment.
And blah, blah, blah.
Here's the point I'm trying to make.
Yeah, it's the same shit.
So here's the point I'm trying to make, okay?
It's 79 to 95.
Okay.
What's the good, what is the bad stuff I got bad and what's the good stuff I got good?
So it's true positive and true negative, okay?
So the numbers are a little slow.
You see that the true negative was really bringing it up.
And I think once I get different data, it should reduce a little bit, should even it
out.
The point I'm trying to make here is that if you calculate this, given this error rate,
which means this, something that this picks up on your logs and tells you to look at is
about 13 times to 18 times more likely to be attacking you than all the rest, okay?
So the point here is that wouldn't you have your analysts look at that first?
I'm not saying ‑‑ this is not catching everything.
And neither are the analysts.
You know, spoiler alert.
But if you have time constraints, okay, and there is so much people, so much time.
They have to eat.
They have to sleep.
I know.
It's terrible, right?
Labor laws.
But let's try to make this a little bit easier, right?
And this is where this is coming from, okay?
So this is an idea of prediction, okay?
So I'm just sampling ‑‑ it's the same curve you saw before.
I'm sampling 100K IP addresses from each point.
And just making sure ‑‑ so the brighter the tile, the more likely it is for people
to be attacking you from that, okay?
And this is a logarithm rate.
So it can go like 10 to 1,000, things like that.
Anyway, these are the old guys.
You see, it's everywhere.
Anyway.
Challenges.
IP addresses are bad.
They're the worst stuff you can have to try to do real incident response.
But it's just a point I'm trying to prove here.
. . . . . . . . . . . . .
. . . . When you have pretty shitty stuff you
can actually get some interesting results.
But I mean, anonymous proxy okay, it holds well but Tor, there's not a lot of clustering.
If you're coming from Tor it starts messing it up a little bit.
If you're just changing your IP address every 30 minutes, fuck you.
. . . . But I believe that if you can reduce the
cycles.
So this is a daily resolution.
You could start getting smarter about that.
But then I wouldn't just go on IP addresses.
I would go on different stuff.
stuff. So, anyway, where am I trying to take this? As it is, it could potentially help
security analysts. Like I said, it brings a new priority dimension to a stock work.
You want to do the, like, okay, these are the most important assets, but these are the
guys who are most likely to be able to get me. So, yeah, let's see who are the guys here
and let's invest some time in that. What I think is really cool is that I created a
model for firewalls, right? I could do exactly the same thing for IPS. I could do exactly
the same thing for WAFs. And if I start taking the inference that each of these individual
guys is giving me on these IP addresses and I combine them, I could come to you and say,
okay, this guy has a 200 times more likely to attack you. I would block this fucker,
you know? So, and this is the kind of confidence we're trying to build. And I don't know, blocking
is a very bad word in information security, especially if you're defending, but maybe
we can get to that confidence. And the way I'm trying to do this, there's
actually a project which what I'm pretty much is begging for data. You send me data, I'll
send you reports based on what the model is sending back. And the point is the more data
I have, I can start fighting the bias that I got from the sense database. Okay? So there's
some URLs, there's a Twitter feed. If you look at ‑‑ if you look it up on Google,
it might be there. If you look it up on Google, it might be there. If you look it up on Google,
it might come up. I'm not sure. But anyway, I'll be around. Anyway, takeaways. Machine
learning is cool. And it can help. Okay? It's not a monster. Of course, there is marketing
hype and stuff, but it can really help. And I think 13218 is pretty good. Thanks.
